Overview

The goal of this workshop is to introduce you to some basic quality control and analysis of single-cell RNA-seq data. Importantly, this workshop does not go through clustering nor cell-type annotation as it is not exhaustive, but will hopefully help build foundational skills necessary for analysing transcriptomic data. Note: What you learn in this workshop is also applicable to bulk RNA data and proteomic data.

Time outline

Structure of workshop:

Activity
Part 1: Introduction to R and RStudio
Part 2: Quality control of scRNA data
Part 3: DE analysis
Part 4: Pathway analysis
Part 5: Cell segmentation with BIDCell

Quality Control

Load data & libraries

This data is from a study using high-throughput single-cell mRNA sequencing technique, (Chromium Single Cell Gene Expression Solution - 10x Genomics) of mouse L4 Dorsal Root Ganglia (DRG) in uninjured and injured conditions (Dorsal root crush).

library(GEOquery)
library(SingleCellExperiment)
library(limma)
library(Seurat)
library(scater)
library(clusterProfiler) 
library(org.Mm.eg.db)
library(enrichplot)
library(GOSemSim)

sce <- readRDS("../data/avraham_sce.rds")
dim(sce)
## [1] 18654  5690
table(sce$condition)
## 
## control injured 
##    3365    2325

Mitochondrial genes

There are many ways to assess and improve the quality of our data. We can check the proportion of mitochondrial gene expression, remove lowly expressed genes, and remove cells with low/high counts. Below are the mitochondrial genes of interest. High mitochondrial gene expression suggests dead cells or duplicates, so it is important to check this early on in the analysis workflow.

mt_genes <- grep("^mt", rownames(sce))

# Print out mitochondrial genes
print(rownames(sce)[mt_genes])
##  [1] "mt-Nd1"  "mt-Nd2"  "mt-Co1"  "mt-Co2"  "mt-Atp8" "mt-Atp6" "mt-Co3" 
##  [8] "mt-Nd3"  "mt-Nd4l" "mt-Nd4"  "mt-Nd5"  "mt-Nd6"  "mt-Cytb"
# Calculate total counts for mitochondrial genes
mt_counts <- colSums(counts(sce)[mt_genes,])
# Calculate library size for each cell
library_size <- colSums(counts(sce))
# Divide mito counts by library size to get proportion
prop_mt_genes <- mt_counts/library_size

par(mfrow = c(1, 2))
boxplot(prop_mt_genes, main = "Proportion of mito genes")
hist(prop_mt_genes, main = " ")

Low expressed genes

Lowly expressed genes are often noisy with minimal biological signal and can affect statistical power as they have high variability with low counts. Here, we remove genes with less than a total of 5 counts.

gene_expression <- rowSums(counts(sce))
summary(gene_expression)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0      10     149    1415     705 1775893
boxplot(gene_expression, main = "Gene counts")

# Remove genes with zero expression
idx <- which(gene_expression < 10)
print(paste0("There are ", length(idx), " genes with less than a total of 5 counts"))
## [1] "There are 4570 genes with less than a total of 5 counts"
sce <- sce[-idx,]

Cell counts

Here we check if there are any cells with a very low number of counts, Seurat suggests removing cells with less than a total of 200 counts but it’s always good to have a look at the distribution.

cell_counts <- colSums(counts(sce))
summary(cell_counts)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1107    2935    4114    4636    5893   18286
hist(cell_counts)

Normalisation

Once we are happy with our QC we can normalise the data. Here are two boxplots where the first contains raw counts from 20 random cells and the second boxplot show the normalised counts for those same cells. We will use the scater package to perform counts per million normalisation using the logNormCounts function. The raw counts are stored in the counts() slot while the CPM normalised data is stored in the logcounts() slot of the SingleCellExperiment object.

sce <- logNormCounts(sce)
cells <- sample(colnames(sce), 20)
par(mfrow = c(1,2))
boxplot(counts(sce)[,cells], main = "Raw counts")
boxplot(logcounts(sce)[,cells], main = "Normalised counts")

## PCA Next we perform dimension reduction, in this case PCA, to visualise our data. Instead of using base R plotting functions, we will use the ggplot2 package which is more intuitive and commonly used for visualising data.

# Calculate PCs
sce <- runPCA(sce)
# Extract PXs
pca <- reducedDim(sce, "PCA")
pca <- as.data.frame(pca)
pca$condition <- sce$condition

ggplot(pca, aes(x = PC1, y = PC2, color = condition)) + geom_point()

DE Analysis

Limma

The goal of DE analyis is to identify genes that are differentially expressed between groups of interest: healthy vs diseased, young vs old, stages of parkinson’s disease etc… The most commonly used package to do this is called limma which constucts linear models and performs a moderated t-test to identify DE genes between groups.

We can see from the design matrix that the reference group (Intercept) is the uninjured group. When we construct the linear model using limma, we will be doing the following comparison: uninjured - injured. Thus, genes with positive logFC or test statistic will be more highly expressed in the uninjured group vs the injured. Genes with a negative logFC or test statistic will be more highly expressed in the injured group.

# Create design matrix
design_matrix <- model.matrix(~ as.factor(sce$condition))
colnames(design_matrix)
## [1] "(Intercept)"                     "as.factor(sce$condition)injured"
# You can also perform normalisation with voom and model the mean-variance trend
#y <- voom(counts(sce), plot = TRUE

# Fit linear model
fit <- lmFit(logcounts(sce), design = design_matrix)
fit <- eBayes(fit)
tt <- topTable(fit, coef = 2, n = Inf, adjust.method = "BH", sort.by = "p")

logFC density

plot(density(tt$logFC))

MA plot

plot(tt$AveExpr, tt$logFC)

Volcano plot

plot(tt$logFC, -log(tt$P.Value))

DE statistics

tt <- signif(tt, digits = 3)
DT::datatable(tt)

Pathway analysis

GO over-representation analysis

Here we select DE genes with an adjusted p-value < 0.05 and logFC > 0.5 and perform a GO over-representation analysis to identify any pathways that are significantly enriched with our genes of interest.

sig_genes <- rownames(tt[tt$adj.P.Val<0.05 & tt$logFC > 0.5,])
ego <- enrichGO(gene = sig_genes,
                keyType = "SYMBOL",
                OrgDb = org.Mm.eg.db,
                pvalueCutoff = 1,
                ont = "BP",
                pAdjustMethod = "BH",
                readable = TRUE)

tmp_ego <- ego[,c(1,2,3,5,6,7,8)]
tmp_ego[,4:6] <- round(tmp_ego[,4:6], digits = 4)
DT::datatable(tmp_ego)
# Dotplot
dotplot(ego)

# Tree plot
d <- godata(org.Mm.eg.db, ont = "BP")
ego_tree <- pairwise_termsim(ego, method = "Wang", semData = d)

suppressMessages(treeplot(ego_tree, offset = 30, font.size = 6, nCluster = 3, showCategory = 10))

GSEA

Unlike GO over-representation analysis, we do not subset our DE genes and instead use all genes output from limma including their -log10(p-values). Since GSEA is a rank based statistical test (Kolmogorov Smirnov test), the list of genes needs to be ordered from most significant to the least significant. Thus, we -log10 transform the adjusted p-values and pass these into the gseGO function.

sig_genes <- -log10(tt$adj.P.Val+1)
names(sig_genes) <- rownames(tt)
gse <- gseGO(geneList=sig_genes, 
             ont ="BP", 
             keyType = "SYMBOL", 
             pvalueCutoff = 0.05, 
             verbose = TRUE, 
             OrgDb = org.Mm.eg.db, 
             pAdjustMethod = "none",
             scoreType = "pos",
             nPermSimple = 10000)

DT::datatable(gse@result)
dotplot(gse)

gse <- pairwise_termsim(gse)
suppressMessages(treeplot(gse, showCategory = 20))